Native Java Library
Prerequisites
pdf2Data Java SDK requires Java 8, Java 11 or Java 17 to be installed on your system.
We guarantee software compatibility with the Oracle JRE 8 and Open JRE 11/17.
We recommend using at least 1.5GB of Java heap space, and 500MB per each additional thread.
System Requirements
- Recommended minimal hardware configuration:
- 2 core CPU
- Memory: 2 GB
- Temp storage: 2 GB free disk space
While the Java SDK will work fine on a single core, we recommend using multiple cores in cases where you handle documents in parallel using separate threads (one document per thread).
Installation
The preferred way to set up iText pdf2Data in Java is to use a build system like Maven or Gradle and download pdf2Data artifacts from the iText Artifactory
The groupId is com.itextpdf.pdf2data, and the artifactId is pdf2data
In Maven, the configuration would look similar to the example below:
Maven
Add the pdf2Data repository to the <repositories> section.
<repositories>
<repository>
<id>pdf2Data</id>
<name>pdf2Data Maven Repository</name>
<url>https://repo.itextsupport.com/pdf2data</url>
</repository>
<repository> <!-- can be skipped if license is unlimited or local reporting is going to be configured -->
<id>itext-releases</id>
<name>iText Repository-releases</name>
<url>https://repo.itextsupport.com/releases</url>
</repository>
</repositories>
And dependency to <dependencies>
<dependencies>
<dependency>
<groupId>com.itextpdf.pdf2data</groupId>
<artifactId>pdf2data</artifactId>
<version>4.5.0</version>
</dependency>
<dependency> <!-- can be skipped if license is unlimited or local reporting is going to be configured -->
<groupId>com.itextpdf.licensing</groupId>
<artifactId>licensing-remote</artifactId>
<version>4.0.5</version>
</dependency>
<dependencies>
Using pdf2Data from your code
As from pdf2Data 4.0, the format of extraction templates has been changed, compared to pdf2Data 3.*. Please see the Migration guide to get to know more
With the pdf2Data UI (pdf2Data 4.0+), you can download templates optimized for use in the pdf2Data SDK.
1. Load the pdf2Data license
Make sure to load the license file before invoking any code
LicenseKey.loadLicenseFile(pathToLicenseFile);
2. Create an extractor
pdf2Data extractor can be created using an extraction template downloaded from pdf2Data UI
The initialization of the Pdf2DataExtractor instance from a processed template should now be done with one function call:
Pdf2DataExtractor extractor = Pdf2DataExtractor.create(new File(P2D_TEMPLATE_PATH));
The extractor can be re-used multiple times, to process batch of pdf files in the loop
3. Extract data from PDF
RecognitionResultHolder result = extractor.recognizeOnPdf(new File(PDF_PATH));
You can use extracted values directly from the result or save them in one of two structured formats
4. Get results for specific data field
You can get all results as sorted map by calling:
SortedMap<String, DataFieldResult> allResults = result.getDataFieldResults();
To get results for specific data field use this call:
List<AbstractValueResult> dataFieldResult = allResults.get(DATAFIELD_NAME).getResults();
Results objects have similar structure to described in Recognition result specification, you can also consult SDK JavaDocs.
5. Save extracted data
By default, your data will be saved without metadata. To include it in the result, you should use method overloads with passing next SerializationProperties:
SerializationProperties properties = new SerializationProperties().setIncludeMetaData(true);
XML
// If you want to write results directly into file.
result.writeToXml(new File(RESULT_XML_PATH));
// writing result directly to HTTP response
result.writeToXml(response.getOutputStream()); // any other OutputStream implementation can be passed here
To save result with metadata
// save to file
result.writeToXml(new File(RESULT_XML_PATH), properties);
// writing result directly to HTTP response
result.writeToXml(response.getOutputStream(), properties); // any other OutputStream implementation can be passed here
JSON
// If you want to write results directly into file.
result.writeToJson(new File(RESULT_JSON_PATH));
// writing result directly to HTTP response
result.writeToJson(response.getOutputStream()); // any other OutputStream implementation can be passed here
To save result with metadata
// save to file
result.writeToJson(new File(RESULT_JSON_PATH), properties);
// writing result directly to HTTP response
result.writeToJson(response.getOutputStream(), properties); // any other OutputStream implementation can be passed here
Full code sample
LicenseKey.loadLicenseFile(pathToLicenseFile);
Pdf2DataExtractor extractor = Pdf2DataExtractor.create(new File(P2D_TEMPLATE_PATH));
RecognitionResultHolder result = extractor.recognizeOnPdf(new File(PDF_PATH));
// If you want to write results directly into file.
result.writeToXml(new File(RESULT_XML_PATH));
result.writeToJson(new File(RESULT_JSON_PATH));
// writing result directly to HTTP response
result.writeToXml(response.getOutputStream()); // any other OutputStream implementation can be passed here
result.writeToJson(response.getOutputStream());
// If you want directly access result objects to further save to e.g. DB or other structured storage:
// all results:
SortedMap<String, DataFieldResult> allResults = result.getResult().getDataFieldResults();
List<AbstractValueResult> dataFieldResult = allResults.get(DATAFIELD_NAME).getResults();
Deprecated API
Note that functions mentioned in samples above were introduced since 4.4.0 and will produce the results in new refined format. Versions before 5.0.0 will still contain legacy API which produces old result format but since 5.0.0 it is going to be dropped, so it is recommended to migrate and use new functions and format.